The Allotrope Data Format Audit Trail and Electronic Signatures Specification [[ADF-AUDIT]] is a specification on how to use audit trails in the Allotrope Data Format [[ADF]] in a standardized way. The specification is both about the Allotrope Audit Trail Ontology and the Audit Trail API that is part of the [[ADF]] APIs.
THESE MATERIALS ARE PROVIDED "AS IS" AND ALLOTROPE EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE WARRANTIES OF NON-INFRINGEMENT, TITLE, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
This document is part of a set of specifications on the Allotrope Framework [[!AF]].
Within this specification, the following standard namespace prefix bindings are used:
Prefix | Namespace |
---|---|
owl: | http://www.w3.org/2002/07/owl# |
rdf: | http://www.w3.org/1999/02/22-rdf-syntax-ns# |
rdfs: | http://www.w3.org/2000/01/rdf-schema# |
xsd: | http://www.w3.org/2001/XMLSchema# |
dct: | http://purl.org/dc/terms/ |
skos: | http://www.w3.org/2004/02/skos/core# |
foaf: | http://xmlns.com/foaf/0.1/ |
org: | http://www.w3.org/ns/org#> |
prov: | http://www.w3.org/ns/prov# |
pav: | http://purl.org/pav/ |
time: | http://www.w3.org/2006/time# |
ore: | http://www.openarchives.org/ore/terms/ |
sh: | http://www.w3.org/ns/shacl# |
hdf: | http://purl.allotrope.org/ontologies/hdf/1.8# |
qb: | http://purl.org/linked-data/cube# |
af-c: | http://purl.allotrope.org/ontologies/common# |
af-cq: | http://purl.allotrope.org/ontologies/common/qualifier# |
af-m: | http://purl.allotrope.org/ontologies/material# |
af-e: | http://purl.allotrope.org/ontologies/equipment# |
af-p: | http://purl.allotrope.org/ontologies/process# |
af-r: | http://purl.allotrope.org/ontologies/result# |
adf-a: | http://purl.allotrope.org/ontologies/audit# |
adf-g: | http://purl.allotrope.org/ontologies/graph# |
adf-dc: | http://purl.allotrope.org/ontologies/datacube# |
adf-dp: | http://purl.allotrope.org/ontologies/datapackage# |
adf-dc-hdf: | http://purl.allotrope.org/ontologies/datacube-hdf-map |
Within the examples, the following namespace prefix bindings are used:
Prefix | Namespace | Description |
---|---|---|
ex: | http://example.org/ | an example namespace |
ex1: | http://example.org/ | an example namespace |
ex2: | http://example.org/ | an example namespace |
The Allotrope Foundation Taxonomies use identifiers (IRI) that are not human-readable. In order to make the examples more understandable, we use the following notation: we prefix the SKOS [[skos-reference]] preferred label of the concept with namespace prefix and surround it with French quotes (guillemets):
«{ns-prefix}:{pref label}», {ns-prefix} is the namespace prefix defined above and {pref label} is the preferred label of the concept defined in the taxonomy.
For example, the notation
«af-x:device manufacturer»
represents the real IRI af-x:AFX_0000333
of the device manufacturer property.
This document uses the Unified Modeling Language to illustrate some concepts and visualize RDF graphs. These diagrams are non-normative and should not be interpreted in the strict interpretation specified by the UML specification [[UML]].
Colors are used to make the domain of the entities more transparent. The color schema is illustrated in the following figure:
Within this document, decimal numbers will use a dot "." as the decimal mark.
The Audit Trail ensures traceability of changes to the contents of the file, independent of IT application or system, making the file a standalone representation of the content and audit trail (user, timestamp, entity changed, old value, new value, reason) and standardizing this aspect of compliance across Allotrope-enabled software. Here, the responsibility of the audit trail is owned by the Allotrope class libraries. For each ADF API, we MUST track who did what, when, why and how? What? describes the changed values either directly or via references, who? describes the primary agent (person or organization) that made the change, when? describes the time interval the changes have been done, why? describes the motivation for the change and how? the software that applied the ADF API.
An Audit trail created by an external software application can be added to the ADF, wherein the responsibility of the audit trail is owned by the external software application. It must be possible for external applications to add audit trail information to the ADF in order to document the complete audit history in one location.
The Allotrope audit trail and electronic signature concepts are all part of the Allotrope Audit ontology with the public IRI
<http://purl.allotrope.org/ontologies/audit>
. The ontology depends on the following external ontologies:
An audit trail is a chronological record of system activities. A single record of a system activity is called an audit record dataset. It is represented in ADF by a dataset, which is a single named graph (this graph may further define or reference other named graphs). The content of the audit record dataset varies depending on the kind of activity and the electronic record that is affected by the activity.
An audit trail is an ordered aggregation (list) of audit record datasets. 1-n relations between instances in an RDF graph are not ordered, so an intermediate proxy object is needed that gives the audit record dataset an explicit order. We use the ORE model of aggregations:
The amount and kind of metadata and data for describing the changes vary with the type of activity that has been done on the ADF file. All of this is stored together in an RDF dataset consisting of a primary named graph (audit record graph) for the audit record and, optionally, other named graphs that describe additions and removals from the default RDF model under audit – the data description. The audit record graph contains copies of, or references to, the data at the time before the change and after, as well as other metadata about the change, such as who?, why?, when?, etc.
Strictly speaking, the 'audit record' is only the proxy for the 'audit record dataset', s.a., but both terms are otherwise synonyms.
The audit trail ontology leverages the PROV-O ontology as a foundation for the metadata. The PROV-O core model is about three classes:
An entity – in our case, the ADF file – was generated by some activity (e.g. a measurement); later, new versions are derived from previous ones. The activity is associated with some agent that can be a person, an organization, or a software that attributed to the entity. Other activities may use the entity. The audit trail of the ADF file is currently only concerned with the derivation of the ADF file; the initial generation and possible invalidation of the ADF file is typically not traced by an audit trail, but it is possible to do so.
The Provenance Ontology distinguishes between three kind of derivations:
Only the revision is used in the audit trail, so that we have 3 types of audit record datasets:
This section describes the content of the revision dataset for an ADF file in detail.
A revision creates a new version of an entity. In our case, the entity is an electronic record, the ADF file. Each transaction generates a new version IRI for the electronic record – see versioning.
Within one audit record, there MUST be only one revision, but that revision may contain multiple change sets. The new version of the ADF file is generated by a single activity and uses the existing version.
The new version is the revision of the old version. The activity metadata includes the time and optionally the physical location. The start and end time
can be stated by explicit with a date/time literal, or some other entity can be stated that acts as a trigger to start resp. end the activity.
# version 1 is revision of version 0
<adf://self/version/1> prov:wasRevisionOf <adf://self/version/0> .
# the activity generated v1 and used v0, Dublin Core title and description can be used to describe the activity
<adf://audit/auditrecord/1/activity> a prov:Activity;
dct:title "Revision 1";
dct:description "A new version of the ADF file is created that ...";
prov:generated <adf://self/version/1> ;
prov:used <adf://self/version/0>
prov:startedAtTime "2016-11-30T16:45:00Z"^^xsd:dateTime ;
prov:endedAtTime "2016-11-30T17:15:00Z"^^xsd:dateTime ;
prov:startedBy <http://example.org/process/...> ; # example of an external trigger reference
prov:atLocation <http://example.org/site1/room101> .
The revision has been done by an agent; usually a person, but it could be also a software system. The agent is assocated with the revised entity via the attribution that also captures an optional role. There are several standard role classes defined in the audit trail vocabulary, and they SHOULD be used if applicable, but any role defined in the standard Allotrope ontologies [[AFO]] can be used as well. What metadata must be stated about the agent to be compliant with regulatory and privacy protection rule is out of scope of this specification, but any detailed information MUST use the FOAF and ORG vocabularies where applicable. The PROV-O vocabulary also allows to state a delegation. The ORG ontology allows a fine granular description of the organizational structure and any posts and roles therein.
<adf://audit/auditrecord/1/attribution> a prov:Attribution;
prov:hadRole adf-a:Approver;
prov:agent <mailto:jerry@example.org> ;
prov:actedOnBehalfOf <mailto:tom@example.org> .
<mailto:jerry@example.org> a prov:Person, foaf:Person;
foaf:mbox <mailto:jerry@example.org> ;
foaf:familyName "Mouse" ;
foaf:givenName "Jerry" ;
org:memberOf <http://example.org/unit1> .
<mailto:tom@example.org> a prov:Person, foaf:Person;
foaf:mbox <mailto:tom@example.org> ;
foaf:familyName "Cat" ;
foaf:givenName "Tom" ;
org:memberOf <http://example.org/unit1> ;
org:holds <http://example.org/unit1/qa> .
<http://example.org/unit/qa> a org:Post ;
dct:title "Quality Assurance Unit 1" .
<http://example.org/unit1> a org:OrganizationalUnit ;
dct:title "Unit 1" ;
org:post <http://example.org/unit1/qa> ;
org.unitOf <http://example.org> .
<http://example.org> a org:FormalOrganization ;
dct:title "Example Organization" .
The details of changes to the electronic record done in the revision and their motivations are tracked in changesets. An electronic record is often composed of parts that can be added to, removed from, or updated in the electronic record. These parts are often composed of other parts, and so on. An ADF file is composed of the data description, the data cube and the data package parts. A changeset can be applied to the whole electronic record under revision, or only to one of these parts – the subject of change. In many cases, the change can be described by the set of removed parts, combined with the set of added parts. The new version of the electronic record is created from the original one and the changeset by first removing all the parts from the original electronic record listed in the removal set of the changeset, followed by adding the parts in the addition set. If the change cannot be decomposed into separate parts of the electronic record, but affects data segments of the electronic record that have no identity of their own, then a data update is needed in the change set. The data update then either lists the old and new data, or it describes the position of the old and new data. For example, a file in the data package cannot be further decomposed into smaller parts. So if the file content - a binary stream of data - is changed, either the whole file must be exchanged, or the data update must be described in detail. For files, the changed data can be described by its size and position within the binary stream. For data updates in data cubes, the parts that have been changed within the cube are described by data selections (slabs or point selections).
An ADF file is composed of three major parts
adf://dd
adf://dc
adf://dp
A change in the data description is always described by a set of removed statements and a set of added statements.
Assuming a resource ex:example
has its title changed from "old title" to "new title", then within the
data description the old statement
ex:example dct:title "old title" .
will be replaced with the new statement
ex:example dct:title "new title" .
In order to describe this change with a changeset, two named graphs are created: one that contains all removed statements (<removal-graph>
), and another containing the
added statements (<addition-graph>
).
The changed data description is created from the original one by the following set operation:
<dd-new> = (<dd-old> − <removal-graph>) ∪ <addition-graph>
= |
− |
∪ |
<dd-old> = (<dd-new> − <addition-graph>) ∪ <removal-graph>
In general the above method will work only if the resources are not anonymous. Anonymous resources (blank nodes) only have a local identity within a graph, only a URI gives a resource a public identity.
Typically serializations of the RDF graph like n3 generate new blank node ids on each output, like _.b1
. The consequences for the audit trail are best explained in an example.
What happens if the following statements are added to the data description in Turtle notation?
ex:example1 ex:hasTopic [ dct:title "about As"] .
ex:example2 ex:hasTopic [ dct:title "about As"] .
Internally, this will be represented in triple form:
ex:example1 ex:hasTopic _:b1;
_:b1 dct:title "about As";
ex:example2 ex:hasTopic _:b2;
_:b2 dct:title "about As";
with b1
and b2
being representations of the internal identifiers for the two blank nodes. Both graphs are fully isomorphical to each other and so would be
ex:example1 ex:hasTopic _:b10;
_:b10 dct:title "about As";
ex:example2 ex:hasTopic _:b20;
_:b20 dct:title "about As";
Now, if we change dct:title
of the ex:example2
to [dct:title "about Bs"]
, we get a removal graph
<removal-graph> {
_:b2 dct:title "about As" .
}
and the addition graph is
<addition-graph> {
_:b2 dct:title "about Bs" .
}
However, _:b2
is only known in the original data description graph. The _:b2
in the removal resp. addition graph can have any other internal blank node identifier, for example _:b1
<removal-graph> {
_:b1 dct:title "about As" .
}
In any case, the information in the audit record named graphs about the original blank node identifiers is lost. The equivalent turtle representation
<removal-graph> {
[ dct:title "about As" ] .
}
better reflects this. In ADF, we ensure that the internal identifiers of blank nodes are kept the same as in the original data description when the audit trail addition/removal graphs are created,
so the change location can be identified. However, any external representation of the audit record will not have this information available. A fully repeatable audit trail must make the internal identifiers of
all blank nodes explicit to an external representation of the graph. External changes must keep track of the naming (skolemization) when making changes:
ex:example1 ex:hasTopic _:b10;
_:b10 dct:title "about As";
ex:example2 ex:hasTopic _:b20;
_:b20 dct:title "about As";
# following is explicit blank node labeling
_:b10 adf-a:blankNodeId 'a8735c8d-cb51-4a76-813e-da85abb23827' . # giving the blank node _:b10 an external unique id
_:b20 adf-a:blankNodeId '4cd2d444-8333-42d2-8bd1-02b8f006a246' . # giving the blank node _:b20 an external unique id
In this case, the portable removal graph would look as follows (note the different blank node URI _:b1 instead of _:b10):
<removal-graph> {
_:b1 dct:title "about As" .
_:b1 adf-a:blankNodeId 'a8735c8d-cb51-4a76-813e-da85abb23827'
}
The labeling statements are metadata and not a real part of the graph's content and could be put into a separate named graph. However, in this case, they can get easily lost.
adf-a:blankNodeId
is not made explicit.
ex:example1 ex:hasTopic _:b10
statement provides
the identity for the blank node _:b10
. This works well except in some obscure, artifical border cases.
Since the data description in the newer ADF files can also make use of named graphs, the addition, removal and update of named graphs needs to be tracked in the audit trail as well. Because of the multiple named graphs, we are no longer adding or removing statements (triples), but effectively quadruples, that beside the subject, predicate and object also contain the owning graph. The audit trail model, however does not describe changes in the form of adding and removing quadruples to the data description directly, but instead it tracks addition and removal of graphs to the data description and the updates of them. In this model, the addition resp. removal of a statement (triple) is a DataUpdate of a named graph. This is similar to how updates on data cubes or data package nodes are handled. The target of the data update is the named graph, and the set of added statements and removed statements is used to describe the old and new dataset.
<changeset-dd> a adf-a:ChangeSet;
adf-a:addition <added-named-graph>;
adf-a:removal <removed-named-graph>;
adf-a:update <named-graph-update>;
adf-a:subjectOfChange <adf://dd> .
<added-named-graph> a void:Dataset, adf-g:Graph .
<removed-named-graph> a void:Dataset, adf-g:Graph .
<named-graph-update> a adf-a:DataUpdate;
adf-a:target <updated-named-graph>; # update on named graph
adf-a:newData <added-to-named-graph>; # new data = added triples stored in a named graph
adf-a:oldData <removed-from-named-graph> . # old data = removed triples stored in a named graph
<added-to-named-graph> a void:Dataset, adf-g:Graph .
<removed-from-named-graph> a void:Dataset, adf-g:Graph .
# named graph with the added statements
<added-to-named-graph> {
ex:example dct:title "new title";
}
#named graph with the removed statements
<removed-from-named-graph> {
ex:example dct:title "old title";
}
A changeset in the data cube part of the ADF file contains the added and removed data cubes, and any data updates in the cube.
<changeset-dc> a adf-a:ChangeSet;
adf-a:addition <adf://dc/added-cube>;
adf-a:removal <adf://dc/removed-cube>;
adf-a:update <adf://dc/updated-cube-data>;
adf-a:subjectOfChange <adf://dc> .
<adf://dc/added-cube> a qb:DataSet ;
qb:structure ...
# the removed data cube gets a link to the backup archive (not implemented)
<adf://dc/removed-cube> a qb:DataSet ;
qb:structure ...
adf-a:archivedTo <adf://dc/archive/removed-cube> .
# the cube will be moved in an archive section of the ADF file (not implemented)
<adf://audit/changes/dc/removed-cube> a qb:DataSet ;
qb:structure ...
adf-a:isArchiveOf <adf://dc/removed-cube> .
<adf://dc/updated-cube> a qb:DataSet ;
qb:structure <update-cube-structure> ;
# the archived data cube containing the overridden data will be put in an archive section (not implemented)
<adf://audit/changes/dc/update-cube> a qb:DataSet ;
qb:structure <archive-cube-structure> ;
<updated-cube-data> a adf-a:DataUpdate ;
adf-a:target <updated-cube> ;
adf-a:oldDataReference <selection-old> ;
adf-a:newDataReference <selection-new> .
<selection-old> a adf-dc:DataSelection ;
adf-dc:selectionOn <archived-cube> ;
...
<selection-new> a adf-dc:DataSelection ;
adf-dc:selectionOn <updated-cube> ;
...
Appended data does not need to be stored as an archive, but can be referenced directly
on the current version of the cube; however, any data that is deleted or overwritten must be stored in an archive data cube.
A data selection creates a stream of observation data in the order of the cube dimensions. Thus, it is not necessary for the
archive data cube to have the same structure as the updated/deleted data cube. A single-dimension
data cube is sufficient to store the changed data. It is even possible and efficient to use a single archive data cube to store multiple
data updates:
This audit trail specification currently defines no policy on what the version IRI for the ADF file should look like; it is simply a reference point for tracking the state. However, the
audit trail API uses the org.allotrope.audit.service.VersioningService
service for generating version IRIs. Its implementation applies the following canonical versioning scheme, following
a RESTful, linked data approach:
/version/
to it, followed by either a version string (if defined), or a timestamp based version string if not.
The timestamp based version string is the ISO 8601 date-time format string in UTC of the time the version is created.
A version IRI for an resource created on November 30th, 2016 at 11:45 a.m. UTC using the timestamp based versioning would be http://example.org/resource/version/2016-11-30T11:45:00
.
http://example.org/resource
is base URL of the resource.
The current versioning service implementation uses a version string that is an incremented integer. The provenance and the version strings are described using the PAV vocabulary.
The initial version of the ADF file has the (locally resolvable) IRI adf://self/version/0
.
adf://self/
is the local ADF URL of the ADF file, which is an alias its the public UUID URN.
The version metadata is
<adf://self> pav:hasVersion <adf://self/version/0> ;
pav:currentVersion <adf://self/version/0> .
<adf://self/version/0> pav:hasVersion "0" .
The next version IRI becomes adf://self/version/1
.
Its version metadata is
<adf://self> pav:hasVersion <adf://self/version/0>, <adf://self/version/1> ;
pav:currentVersion <adf://self/version/0> .
<adf://self/version/1> pav:hasVersion "1" ;
pav:previousVersion <adf://self/version/0> .
In most cases, resources within an ADF file get a UUID URN. UUID URNs have the advantage that an algorithm for generating UUIDs is specified that can be executed
without access to any central authority that ensures the uniqueness of the URN. However, UUIDs are not an ideal solution for IRIs if we want to follow the linked data
approach that IRIs should be resolvable. UUIDs are not resolvable, so if one wants to access a resource such as a file in the data package by its UUID URN, a mapping
of UUIDs to the files in the data package must be kept somewhere. A much more natural URI for the file would be the path in the data package, which is unique (within the ADF file) and allows
to find it directly without any additional mapping, simply by traversing the path to the location of the file. This is how file://
and http://
URLs work. The linked data approach recommends to use http-URLs as identifiers, but we cannot follow that approach in an ADF file because there is no
authoritative http-Server for it that could resolve a http-URL addressing something within a ADF file.
To make resources in an ADF file resolvable, we define a locally scoped URL custom scheme adf://
that can be resolved only within an ADF file. This URL is a local
alias for a resource that is only valid in the context of the ADF file. To make it globally unique, that local URL must be transformed into a public URI, either
by giving it an explicit public URI like a UUID URN or a public http:-URL, or the URL must be a combination of the public URL of the ADF file and the locally resolved
adf-URL.
Accessing parts of a HTML or XML document with a file or http-URL can be done through anchor fragments, or using the XPointer specification [[xptr-xpointer]],[[RFC3986]]. A similar approach can
be used to make ADF URLs public:
For example, take http://example.org/example.adf#adf://dd
. The local URL would be the fragment accessor into the data description part of the ADF file, accessible under
http://example.org/example.adf
.
The following ADF URLs are defined and reserved:
adf://self | A self reference that is an alias for the ADF file (the container) itself. |
adf://dd | The URL of the data description. This URL is a named graph IRI for the triples stored in the data description. |
adf://dc | This URL is the base URL for all data cubes defined in the ADF file. |
adf://dp | This URL is the base URL for the files and folders stored in the data package of the ADF file. Files and folders in the data package have local ADF URLs by appending their path to this base URL. |
adf://audit | This URL is the base URL for audit trail URLs, such as the audit trail of the ADF file, or any audit records that are part of it. |
adf://dp/folder1/folder2/file1
is an URL for a file with the pathfolder1/folder2
and the namefile1
in the data package of the ADF file.
Using these locally resolved ADF URLs, the resulting RDF triples are much easier to read and, if RESTful principles of URL design are followed, they can in many places be self-documenting. UUID URNs hide any details of the type of resource, and are very difficult to validate in larger graphs, so use of local ADF URLs is recommended.
Care has to be
taken when resource IRIs are exported or imported into an ADF file. An ADF URL MUST never be used outside of an ADF file. If a resource also has a public IRI, an explicit equivalence mapping using
owl:sameAs
between public and local IRI SHOULD be added.
<adf://self> dct:hasPart <adf://dp>, <adf://dd>, <adf://dp> .
Similar to the logical resources in a ADF file, also the underlying implementations (representations) in HDF can be locally resolved. For accessing HDF resources, we use
the custom URL scheme hdf://
. These HDF URLs can be used to address all HDF named objects, which include HDF groups, datasets and named datatypes. A HDF named object is
identified by its path.
hdf://
is the root group of the HDF file.hdf://a/b/c
is a named object in the nested group/a/b
. This URL does not tell whether c is a group, dataset or datatype.
An extension of this URL scheme for typed HDF URLs that also indicates the type of the resource could be to add a query [[RFC3986]] with a type parameter, for example ?@type=${typeIRI}
, to the URL.
The parameter name @type
follows the JSON-LD convention.
<hdf://a/b/c?@type=hdf:Dataset>
would be the typed URL for a HDF dataset with the HDF path /a/b/c
.
The information about DpFile revisions are collected by iterating over all Audit Records. To improve performance all results are added to the audit record dataset. The graph contains the resource of a revisioned DpFile pointing to a revision resource of its previous revision.
Version | Release Date | Remarks |
---|---|---|
0.1.0 | 2016-12-07 |
|
0.2.0 | 2017-03-31 |
|
1.3.0 | 2017-06-30 |
|
1.4.3 RC | 2018-10-11 |
|
1.4.5 RF | 2018-12-17 |
|
1.5.0 RC | 2019-12-12 |
|
1.5.0 RF | 2020-03-03 |
|
1.5.3 RF | 2020-11-30 |
|